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this  report  provides  a  final  update  and  program  summary. 
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1  Introduction 


The  goal  of  this  research  program  was  to  obtain  new  solutions  to  fundamental 
problems  in  computer  vision,  particularly  to  such  problems  as  stereo  compilation, 
feature  extraction,  and  general  scene  modeling  that  are  relevant  to  the  development 
of  an  automated  capability  for  interpreting  aerial  imagery  and  the  production  of 
cartographic  products. 

To  achieve  this  goal,  we  engaged  in  investigations  of  such  basic  issues  as  image 
matching,  scene  partitioning,  shape  representation,  and  physical  modeling.  How¬ 
ever,  it  is  obvious  that  high-level  high-performance  vision  requires  the  use  of  both 
intelligence  and  stored  knowledge  (to  provide  an  integrative  framework),  as  well  as 
an  understanding  of  the  physics  and  mathematics  of  the  imaging  process  (to  pro¬ 
vide  the  basic  information  needed  for  reasoned  interpretation  of  the  sensed  data). 
Thus,  a  significant  portion  of  our  work  was  devoted  to  developing  new  approaches 
to  the  problem  of  “knowledge-based  vision."  Finally,  vision  research  cannot  proceed 
without  a  means  for  effective  implementation,  demonstration,  and  experimental  ver¬ 
ification  of  theoretical  concepts;  we  have  developed  an  environment  incorporating 
some  of  the  newest  and  most  effective  computing  tools  currently  available. 

The  research  results  described  in  this  final  report  are  partitioned  into  three  topic 
areas:  (1)  three-dimensional  scene  modeling  and  stereo  reconstruction;  (2)  feature 
extraction:  scene  partitioning,  semantic  labeling,  and  the  representation  of  natural 
scenes;  and  (3)  computing  environments  and  technology  transfer. 

1.1  Three-Dimensional  Scene  Modeling  and  Stereo 
Reconstruction 

Our  goal  in  this  research  area  was  to  develop  automated  methods  for  producing 
a  3-D  scene  model  from  several  images  recorded  from  different  viewpoints.  The 
standard  approach  to  this  problem  is  to  use  stereo  compilation  —  a  technique  that 
involves  finding  pairs  of  corresponding  scene  points  in  two  images  (which  depict  the 
scene  from  different  spatial  locations)  and  using  triangulation  to  determine  scene 
depth  [Barnard  82],  Various  factors  associated  with  viewing  conditions  and  scene 
content  can  cause  the  matching  process  to  fail;  these  factors  include  occlusion, 
projective  or  imaging  distortion,  featureless  areas,  and  repeated  or  periodic  scene 
structures.  In  this  section  we  discuss  some  of  the  ways  we  devised  for  dealing  with 
these  problems:  more  effective  methods  for  image  matching,  interpolation  for  filling 
in  “holes”  caused  by  matching  failure,  and  some  exciting  and  radically  new  methods 
for  3-D  modeling  which  avoid  the  need  for  local  matching. 


1.1.1  Baseline  Stereo  System 

We  have  implemented  a  complete  state-of-the-art  stereo  system  (STEREOSYS)  to 
produce  dense  three-dimensional  data  from  stereo  pairs  of  intensity  images  using 
automated  area-based  stereo  compilation.  This  system  operates  in  several  passes 
over  the  data,  during  which  it  iteratively  builds,  checks,  and  refines  its  model  of  the 
three-dimensional  world,  as  originally  represented  by  the  pair  of  images. 

Our  research  strategy  had  been  to  implement  a  baseline  system  that  performs 
conventional  stereo  compilation,  then  to  replace  pieces  of  the  system  with  improved 
modules  as  we  developed  them.  We  evaluated  new  developments  by  testing  the  “up¬ 
dated”  baseline  system  [Hannah  85a]  against  a  “challenge  data  base”  [Hannah  85b] 
of  image  areas  where  conventional  stereo  techniques  encounter  difficulty. 

Our  system  includes  routines  to  perform  the  following  operations  automatically: 

•  Construct  resolution  hierarchies  for  stereo  images. 

•  Select  “interesting”  points  for  sparse  matching. 

•  Search  2-D  regions  for  corresponding  points  (sparse  matching). 

•  If  necessary  for  uncalibrated  imagery,  compute  relative  camera  parameters 
from  sparse  matches. 

•  Compute  epipolar  lines. 

•  Locate  epipolar  matches,  using  disparity  estimates  from  sparse  matches  when 
available. 

•  Evaluate  matched  points  for  believability. 

•  Interpolate  between  matched  points. 

•  Display  images  and  results  in  left-right  stereo,  red-green  stereo,  or  as  a  monoc¬ 
ular  disparity  field. 

•  Compute  range  data  and  x-y-z  coordinates  for  matched  point  pairs. 

•  Display  terrain  data  in  perspective  with  hidden  lines  removed. 

(For  a  more  complete  description  of  the  components  of  STEREOSYS,  see  [Han¬ 
nah  85c]  or  [Hannah  86].) 

Precise  quantitative  evaluation  of  the  accuracy  of  STEREOSYS  was  difficult 
because  we  were  not  able  to  obtain  stereo  data  sets  with  known  ground  truth 
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with  which  to  compare  our  results.  We  did,  however,  have  the  results  of  an  in¬ 
teractive  stereo  compilation  algorithm,  the  Digital  Interactive  Mapping  Program 
(DIMP),  produced  and  operated  by  the  U.S.  Army  Engineer  Topographic  Labo¬ 
ratories  (ETL)  [Norvelle  81].  Detailed  comparisons  of  results  on  this  and  other 
data  sets  are  presented  in  [Hannah  85a].  Overall,  we  have  found  that  the  results  of 
STEREOSYS  agree  quite  well  with  DIMP  results  and  with  human  perceptions.  In 
addition,  STEREOSYS  remedies  some  of  the  obvious  problems  we  had  seen  with 
existing  systems,  such  as  DIMP’s  tendency  to  extrapolate  itself  off  track  —  and  of 
course,  STEREOSYS  is  fully  automatic,  while  comparable  systems,  such  as  DIMP, 
require  interactive  operation. 

Overall,  we  have  found  that  STEREOSYS  performs  very  well  on  the  low-resolution 
aerial  imagery  for  which  it  was  designed.  It  has  also  been  applied  to  narrow-angle 
ground-based  imagery  with  reasonable  results:  STEREOSYS  has  difficulties  when 
processing  areas  that  violate  its  premises  about  the  continuity  of  the  world,  but  ex¬ 
periments  linking  it  to  an  edge-based  matcher  [Baker  82]  appear  to  solve  the  most 
severe  versions  of  these  problems. 

STEREOSYS  has  been  used  extensively  at  SRI  and  is  suitable  for  transport  to 
other  VAX  systems  (however,  this  is  research  code  with  the  corresponding  limita¬ 
tions). 

1.1.2  New  Methods  for  Stereo  Compilation 

The  conventional  approach  to  recovering  scene  geometry  from  a  stereo  pair  of  images 
is  based  on  the  matching  of  distinctive  scene  features,  as  well  as  on  the  satisfac¬ 
tion  of  constraints  imposed  by  the  viewing  geometry  (e.g.,  the  epipolar  constraint). 
Typically,  three  step6  are  required:  determination  of  the  relative  orientation  of  the 
two  images,  computation  of  a  sparse  depth  map,  and  derivation  of  a  dense  depth 
map  for  the  given  scene. 

In  the  first  step,  points  corresponding  to  unmistakable  scene  features  are  iden¬ 
tified  in  each  of  the  images.  The  relative  orientation  of  the  two  images  is  then 
calculated  from  these  points.  This  is,  in  part,  an  unconstrained  matching  task. 
Corresponding  image  features  must  be  found.  Without  a  priori  knowledge,  such  a 
matching  procedure  knows  neither  the  approximate  location  (in  the  second  image) 
of  a  feature  found  in  the  first  image,  nor  the  appearance  of  that  feature.  However, 
it  is  often  the  case  that  appearance  will  vary  little  between  images  and  that  the 
images  were  taken  from  similar  positions  relative  to  the  scene. 

Recovery  of  the  relative  orientation  of  the  images  reduces  the  need  for  two- 
dimensional  matching  to  a  one-dimensional  matching  problem;  a  scene  feature  iden¬ 
tified  in  the  first  image  is  found  by  a  one-dimensional  search  along  a  (epipolar)  line 


in  the  second  image.  Identification  of  this  feature  in  the  second  image  makes  it 
possible  to  calculate  the  feature’s  disparity,  and  hence  its  relative  scene  depth. 

Identification  of  corresponding  points  in  the  two  images  is  typically  based  on 
correlation  techniques.  Area-based  correlation  processes  may  be  applied  directly 
to  the  raw  image  irradiances  or  to  images  that  have  been  preprocessed  in  some 
manner.  Edges  (e.g.,  as  identified  by  the  zero  crossings  of  the  Laplacian  of  their 
image  irradiances)  have  also  been  used  to  obtain  correspondences. 

The  outcome  of  this  second  step  is  a  sparse  map  of  the  scene’s  relative  depth  at 
those  points  that  were  identified  in  both  images  of  the  stereo  pair. 

A  sparse  depth  map  does  not  define  the  scene  topography.  The  third  and  fi¬ 
nal  step  in  recovering  the  topography  of  the  scene  is  “filling  in”  this  sparse  map 
to  obtain  a  dense  depth  map  of  the  scene.  Typically,  a  surface  interpolation  or 
approximation  method  is  used  as  a  means  for  calculating  the  dense  depth  map 
from  its  sparse  counterpart  [Smith  84  b].  A  surface  approximation  model  may  be 
formulated  to  provide  desirable  image  properties  (such  as  the  lack  of  additional 
zero  crossings  —  in  the  Laplacian  of  the  image  irradiances  —  that  are  artifacts  of 
the  surface  approximation  model),  but  often  the  surface  model  is  based  on  a  priori 
requirements  for  the  fitted  surface,  such  as  smoothness. 

The  problems  encountered  in  the  first  two  steps  —  recovery  of  the  relative  ori¬ 
entation  of  the  images  and  computation  of  the  sparse  depth  map  —  are  dominated 
by  the  problems  of  image  matching.  False  matches  that  arise  from  repetitive  scene 
structures,  such  as  windows  of  a  building,  or  from  image  features  that  are  not 
distinctive  (at  least,  on  the  basis  of  local  evidence)  occur  more  frequently  in  the 
unconstrained  matching  environment  than  in  the  constrained  environment.  In  re¬ 
covering  the  relative  orientation  of  the  images,  we  can  use  redundant  information 
in  an  effort  to  reduce  the  influence  of  false  matches;  this  is  more  difficult  in  the 
case  when  the  sparse  depth  map  is  computed.  Furthermore,  we  have  little  choice  as 
to  which  features  we  may  use  for  sparse  depth  mapping;  if  we  choose  not  to  use  a 
feature,  we  cannot  recover  the  relative  depth  at  that  scene  point  (without  invoking 
semantic  or  contextual  knowledge). 

The  selection  of  suitable  features  for  determining  image  correspondence  is  dif¬ 
ficult  in  itself.  Correlation  techniques  embed  assumptions  that  are  often  violated 
by  the  best  image  features.  Area-based  correlation  techniques  usually  reflect  the 
premise  that  image  patches  are  of  a  scene  structure  that  is  positioned  at  one  dis¬ 
tinct  depth,  whereas  edges  that  arise  at  an  object’s  boundaries  are  surrounded  by 
surfaces  at  different  scene  depths.  Edge-based  techniques  are  based  on  the  assump¬ 
tion  that  an  edge  found  in  one  image  is  not  “moved”  by  the  change  in  viewing 
position  of  the  second  image,  whereas  zero  crossings  found  at  boundaries  of  objects 
whose  surface  gradients  are  tangential  to  the  line  of  sight  contradict  this  assump- 


tion.  These  would  seem  minor  problems,  were  it  not  for  the  accuracy  required  of 
the  matching  process.  Typically,  the  spatial  resolution  of  disparity  measurements 
must  be  an  order  of  magnitude  better  than  the  image’s  spatial  resolution.  Stereo 
matching  appears  to  require  features  with  properties  that  are  often  incompatible 
with  what  is  practical  in  realistic  situations. 

The  third  step,  derivation  of  a  dense  depth  map  from  a  sparse  one,  is  still  far 
short  of  having  an  adequate  solution.  Most  approaches  employ  “blind”  interpola¬ 
tion,  since  no  effective  methods  are  currently  in  use  for  extracting  depth  from  the 
irradiance  data  in  the  individual  images  of  the  stereo  pair  (although  some  of  the 
new  work  described  in  the  next  section  might  alter  this  situation). 

In  summary,  we  see  that  the  most  demanding  steps  in  the  stereo  compilation 
process  are  the  final  two:  computation  of  a  sparse  depth  map,  and  derivation  of 
its  dense  counterpart.  We  have  developed  a  new  approach  to  stereo  compilation 
which  involves  combining  these  steps  to  recover  a  dense  relative-depth  map  of  the 
scene  directly  from  the  image  data  [Smith  85  Si  86].  We  use  image  irradiance 
profiles  as  input  to  an  integration  procedure  that  returns  the  corresponding  dense 
relative-depth  profile.  This  procedure  does  not  match  image  points  (at  least,  not 
in  the  conventional  sense),  nor  does  it  “fill  in”  data  to  obtain  the  dense  depth  map. 
It  avoids  the  need  to  make  the  restrictive  assumptions  usually  required  for  stereo 
image  matching,  and  it  directly  uses  the  image  irradiance  data  in  recovering  the 
dense  depth  map. 

Other  innovative  approaches  to  stereo  compilation  that  we  have  developed  in¬ 
clude: 

(a)  A  technique  [Quam  84]  that  merges  matching  said  interpolation  in  the  context 
of  a  coarse-to-fine  hierarchical  control  structure;  one  of  the  images  is  geometrically 
warped  to  improve  the  performance  of  a  cross-correlation-based  matching  technique. 
A  surface  interpolation  algorithm  [Smith  84b]  is  used  to  fill  holes  whenever  the 
matching  operator  fails. 

(b)  A  stochastic  optimization  approach  [Barnard  86]  that  provides  a  dense  array 
of  disparities  without  the  need  for  interpolation.  It  uses  a  simulated  annealing 
algorithm  to  find  a  3-D  model  which  best  satisfies  the  goals  of  matching  points 
with  similar  intensities  while  ensuring  that  the  resulting  surfaces  are  as  smooth  as 
possible. 

1.1.3  New  Methods  for  3-D  Modeling  Using  Methods  Which  Do  Not 
Depend  On  Stereo  Correspondence 

We  have  noted  the  fact  that  it  will  not  always  be  possible  to  find  corresponding 
scene  points  in  the  two  images  of  a  conventional  stereo  pair,  and  yet  —  to  recover  a 


dense  scene  model  —  we  need  to  determine  the  depth  at  every  scene  point.  Because 
interpolation  will  not  always  provide  an  acceptable  answer  when  matching  fails,  we 
have  developed  a  number  of  new  techniques  for  recovering  scene  depth  which  do 
not  require  establishing  stereo  correspondence. 

A  significant  body  of  work  exists  in  the  area  of  extracting  depth  from  the  shad¬ 
ing  and  texture  visible  in  a  single  image  (e.g.,  [Smith  83a  Sc  83b]  and  [Pentland  84a], 
these  different  techniques  make  a  variety  of  distinct  assumptions  about  the  nature 
of  the  scene,  the  illumination,  and  the  imaging  geometry.  In  Strat  and  Fischler 
[85],  we  show  that  the  distinct  assumptions  that  are  used  by  each  of  the  different 
schemes  must  be  equivalent  to  providing  a  second  (virtual)  i^age  of  the  original 
scene,  and  that  all  of  these  different  approaches  can  be  translated  into  a  conven¬ 
tional  stereo  formalism.  In  particular,  we  show  that  it  is  frequently  possible  to 
structure  the  problem  as  that  of  recovering  depth  from  a  stereo  pair  consisting  of  a 
conventional  perspective  image  (i.e.,  the  original  image)  and  an  orthographic  image 
(the  virtual  image);  we  provide  a  new  algorithm  needed  to  accomplish  this  type  of 
stereo  reconstruction  task. 

In  Pentland  [85a]  we  show  how  focal  gradients  (image  “blur”),  which  result  from 
the  limited  depth  of  field  inherent  in  most  optical  systems,  can  be  used  to  recover 
scene  depth.  The  advantages  of  this  technique  are  that  it  is  fast  and  computation¬ 
ally  simple,  makes  no  special  assumptions  about  the  scene,  and  avoids  the  stereo 
matching  problem.  Mathematical  analysis  and  experiments  indicate  that  the  accu¬ 
racy  achievable  by  this  technique  is  comparable  to  what  can  be  expected  from  the 
use  of  stereo  disparity  or  motion  parallax  to  determine  scene  depth. 

For  most  purposes  concerned  with  the  analysis  of  imaged  data,  determination 
of  an  array  of  depths  (e.g.,  as  obtained  by  conventional  stereo  methods)  is  only 
the  first  step  in  the  construction  of  a  scene  description.  The  conventional  approach 
next  compiles  largely  continuous  surfaces  from  the  discrete  depth  information,  and 
then  attempts  to  partition  these  surfaces  into  coherent  3-D  objects.  Aside  from 
some  still  unsolved  theoretical  problems,  this  process  is  computationally  expensive 
and  time  consuming.  In  Bolles  and  Baker  [85],  we  describe  a  new  method  for  using 
camera  motion  through  a  scene  to  obtain  a  3-D  model  in  which  higher-level  scene 
attributes  are  directly  accessible. 

This  technique  is  based  on  considering  a  dense  sequence  of  images  as  forming  a 
solid  block  of  data.  Slices  through  this  solid  at  appropriately  chosen  angles  intermix 
time  and  spatial  data  in  such  a  way  as  to  simplify  the  partitioning  problem:  These 
slices  have  more  explicit  structure  than  the  conventional  images  from  which  they 
were  obtained.  We  believe  that  this  work  is  a  very  important  development;  it  offers 
a  completely  new  and  direct  method  for  accessing  information  about  scene  objects 
without  requiring  a  completely  bottom-up  analysis  process. 


1.2  Feature  Extraction:  Scene  Partitioning,  Semantic 
Labeling,  and  the  Representation  of  Natural  Scenes 

Creating  a  scene  description  from  a  photographic  image  requires  the  ability  to  per¬ 
form  two  basic  operations:  (a)  partitioning  the  image  into  independent  or  coherent 
pieces,  and  (b)  assigning  names  or  semantic  labels  to  these  pieces. 

1.2.1  Scene  Partitioning 

The  partitioning  operation,  necessary  to  reduce  the  computational  complexity  of 
the  subsequent  scene  analysis  steps,  has  proven  to  be  extremely  difficult  —  the 
performance  of  automated  systems  is  still  far  inferior  to  that  of  humans.  In  part, 
this  disparity  in  performance  is  due  to  the  fact  that  humans  appear  to  use  contextual 
knowledge  and  past  experience  is  such  tasks,  while  most  available  computational 
techniques  employ  only  the  local  intensity  patterns  visible  in  the  image,  i.e.,  they 
perform  “syntactic  partitioning.”  For  practical  as  well  as  theoretical  reasons,  we 
have  carried  out  an  investigation  to  determine  the  competence  limits  of  a  purely 
syntactic  approach  to  partitioning  and,  simultaneously,  to  construct  an  operational 
system  that  approaches  these  limits.  This  investigation  has  resulted  in  a  very  high 
performance  system  described  in  a  paper  by  Laws  [85]. 

Barnard  [84b]  describes  one  of  a  number  of  investigations  that  attempt  to  provide 
a  theoretical  basis  for  the  partitioning  process.  In  this  paper,  Barnard  explores  the 
idea  that  partitioning  decisions  result  in  alternative  descriptions  of  a  scene,  and 
that  the  preferred  partitioning  is  the  one  that  provides  the  “simplest”  description. 
In  a  paper  by  Fischler  and  Bolles  [83],  partitioning  is  viewed  as  an  explanation 
of  how  the  image  is  related  to  the  scene  from  which  it  was  derived;  it  is  shown 
that  completeness  and  stability  of  explanation,  as  well  as  simplicity,  are  necessary 
partitioning  criteria,  since  these  attributes  are  necessary  for  an  explanation  to  be 
believable. 

1.2.2  Feature  Delineation  and  Semantic  Labeling 

In  Fua  and  Hanson  [85,  86a  k  86b],  we  describe  an  approach  to  the  problem  of  con¬ 
verting  a  syntactically  partitioned  image  (e.g.,  one  provided  by  Laws’  segmentation 
system)  into  a  semantic  description.  This  work  has  produced  a  system  that  can 
extract  cultural  objects  from  aerial  imagery;  it  uses  geometric  reasoning  to  identify 
semantically  significant  arrangements  of  straight  line  segments  in  the  borders  of  the 
supplied  partition.  Emphasis  is  placed  on  using  generic  models  that  characterize 
significant  kinds  of  geometric  relationships  and  shapes,  thereby  avoiding  the  well- 
known  drawbacks  inherent  in  the  use  of  specific  object  templates.  An  important 
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feature  of  this  system  is  the  generation  of  an  explanation  for  any  detected  discrep¬ 
ancy  between  the  hypothesised  object  models  and  the  initial  partition.  In  principle, 
this  technique  should  ultimately  permit  intelligent  compensation  for  anomalies  due 
to  imaging  or  environmental  effects  that  would  be  recognized  by  a  well-briefed  hu¬ 
man  analyst;  for  example,  on  the  basis  of  illumination  effects  consistent  with  the 
known  sun  position,  the  system  can  identify  two  contrasting  regions  of  a  peaked 
roof  as  belonging  to  a  single  house. 


1.2.3  The  Representation  and  Recognition  of  Natural  Forms 

Our  research  in  this  area  addressed  two  related  problems:  (1)  representing  natural 
shapes  such  as  mountains,  vegetation,  and  clouds,  and  (2)  computing  such  descrip¬ 
tions  from  image  data.  A  key  step  toward  solving  these  problems  is  to  obtain  a 
model  of  natural  surface  shapes. 

A  model  of  natural  surfaces  is  extremely  important  because  we  face  problems 
that  seem  impossible  to  address  with  standard  descriptive  computer-vision  tech¬ 
niques.  How,  for  instance,  can  we  describe  and  recognize  the  shape  of  leaves  on 
a  tree?  Or  grass?  Or  clouds?  When  we  attempt  to  describe  such  common  nat¬ 
ural  shapes  using  standard  representations,  the  result  is  a  model  of  impractical 
complexity. 

Furthermore,  how  can  we  extract  3-D  information  from  the  image  of  a  textured 
surface  when  we  have  no  models  that  describe  natural  surfaces  and  how  they  evi¬ 
dence  themselves  in  the  image?  The  lack  of  such  a  3-D  model  has  restricted  image 
texture  descriptions  to  being  ad  hoc  statistical  measures  of  the  image  intensity 
surface. 

Fractal  functions,  a  novel  class  of  naturally  arising  functions,  are  a  good  choice 
for  modeling  natural  surfaces  because  many  basic  physical  processes  (e.g.,  erosion 
and  aggregation)  produce  a  fractal  surface  shape,  and  because  fractals  are  widely 
used  as  a  graphics  tool  for  generating  natural-looking  shapes.  Additionally,  in  a 
survey  of  natural  imagery,  we  found  that  a  fractal  model  of  imaged  3-D  surfaces 
furnishes  an  accurate  description  of  both  textured  and  shaded  image  regions,  thus 
providing  justification  for  the  use  of  this  physics-derived  model. 

Progress  relevant  to  computing  3-D  information  from  imaged  data  has  been 
achieved  by  use  of  the  fractal  model.  A  test  has  been  derived  to  determine  whether 
the  fractal  model  is  valid  for  a  particular  set  of  image  data,  an  empirical  method  for 
computing  surface  roughness  from  image  data  has  been  developed,  and  substantial 
progress  has  been  made  in  the  areas  of  shape-from-texture  and  texture  segmenta¬ 
tion.  Characterization  of  image  texture  by  means  of  a  fractal  surface  model  has 
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also  shed  considerable  light  on  the  physical  basis  for  several  of  the  texture  parti¬ 
tioning  techniques  currently  in  use,  and  made  it  possible  to  describe  image  texture 
in  a  manner  that  is  stable  over  transformations  of  scale  and  linear  transforms  of 
intensity.  The  computation  of  a  3-D  fractal-based  representation  from  actual  image 
data  has  been  demonstrated.  This  work  has  shown  the  potential  of  a  fractal-based 
representation  for  efficiently  computing  good  3-D  representations  for  a  variety  of 
natural  shapes,  including  such  seemingly  difficult  cases  as  mountains,  vegetation, 
and  clouds. 

Finally,  the  fractal  model  of  surface  shape  has  been  used  in  a  technique  for  3-D 
shape  estimation  that  treats  shading  and  texture  in  a  unified  manner.  Previously, 
shape-from-shading  and  texture  methods  have  had  the  serious  drawback  that  they 
are  applicable  only  to  smooth  surfaces,  while  read  surfaces  are  often  rough  and 
crumpled.  We  have  extended  one  class  of  such  methods  to  more  realistic  surfaces 
by  using  the  fractal  surface  model,  constructing  a  method  for  estimating  3-D  shape 
that  treats  shading  and  texture  in  a  unified  manner  [Pentland  84a] 

We  have  constructed  a  representational  system  that  combines  the  fractal  func¬ 
tions  described  above,  for  use  in  describing  3-D  texture,  and  superquadric  functions 
(defined  below)  for  describing  shape  in  a  concise  and  natural  manner. 

The  idea  behind  this  representational  system  is  to  provide  a  vocabulary  of  shapes 
and  transformations  that  will  allow  us  to  model  an  object  world  as  a  relatively  sim¬ 
ple  composition  of  component  “parts,”  much  as  people  seem  to  do.  The  most 
primitive  notion  in  this  represention  may  be  thought  of  as  analogous  to  a  “lump  of 
clay,”  a  modeling  primitive  that  may  be  deformed  and  shaped,  but  that  is  intended 
to  correspond  roughly  to  our  naive  perceptual  notion  of  “a  part.”  For  this  basic 
modeling  element  we  use  a  parameterized  family  of  shapes  known  as  superquadrics. 
This  family  of  functions  includes  cubes,  cylinders,  spheres,  diamonds,  and  pyrami¬ 
dal  shapes  as  well  as  the  round-edged  shapes  intermediate  between  these  standard 
shapes.  Superquadrics  are,  therefore,  a  superset  of  the  modeling  primitives  cur¬ 
rently  in  common  use. 

These  basic  “lumps  of  clay”  are  used  as  prototypes  that  are  then  deformed  by 
stretching,  bending,  twisting  or  tapering,  and  then  combined  using  Boolean  opera¬ 
tions  to  form  new,  complex  prototypes  that  may,  recursively,  again  be  subjected  to 
deformation  and  Boolean  combination  [Pentland  86b,  86c  86d]. 

We  have  also  made  significant  progress  toward  the  reliable  recovery  of  these 
modeling  primitives  from  image  data.  We  have  developed  theoretical  results  that 
show  how  such  descriptive  primitives  may  be  recovered  in  an  overconstrained,  and 
therefore  reliable,  manner  [Pentland  86e]. 

This  research  has  contributed  to  the  development  of  a  computational  theory  of 
vision  applicable  to  natural  surface  shapes,  compact  representations  of  shape  useful 


for  describing  natural  surfaces,  and  real-time  regeneration  and  display  of  natural 
scenes. 


2  Computing  Environments  and  Technology 
Transfer 

Machine  vision  is  largely  an  experimental  science;  progress  in  this  science  depends 
on  having  available  massive  amounts  of  computing  power,  and  on  methods  for 
manipulating  and  displaying  images,  their  transformations,  and  depictions  of  the 
corresponding  scene  content.  Technology  transfer  must  often  be  in  the  form  of 
machine  code  that  can  run  in  a  compatible  computing  environment. 

As  part  of  our  previously  described  work,  we  have  built  a  research  environment 
based  on  the  VAX-11/780  computer  and  have  made  this  environment  available 
to  appropriate  university  and  government  institutions.  This  environment  includes 
standard  utilities  (e.g.,  low-level  image  operators)  and  advanced  scene-modeling 
and  recognition  techniques  (e.g.,  the  Hannah  Baseline  Stereo  System,  the  Laws 
Segmentation  System,  and  the  Generalized  Hough  Transform) . 

More  recently,  we  have  constructed  a  powerful  computing  environment  based  on 
the  Symbolics  3600  series  of  LISP  machines  and  an  SRI  product  called  Image-Calc. 
Using  this  environment  as  a  substrate,  we  have  developed  a  number  of  scene  analysis 
and  rendering  systems.  For  example,  a  system  called  Terrain-Calc  [Quam  85]  can  be 
used  to  synthesize  realistic  sequences  of  perspective  stereo  views  of  real-world  terrain 
from  stored  geometric  and  photmetric  models.  This  system  has  a  sophisticated 
graphical  interface  that  allows  the  user  to  specify  an  arbitrary  flight  path  over 
a  modeled  piece  of  terrain.  A  sequence  of  views  (single  images  or  stereo  pairs, 
as  desired),  spaced  at  equal  distances  along  the  flight  path,  can  be  generated  at 
about  1  frame/minute;  up  to  60  frames  can  be  displayed  at  16  frames/second.  This 
system  is  revolutionary  in  its  flexibility,  computational  efficiency,  and  the  quality 
of  the  renderings  it  produces  —  given  that  it  does  not  employ  any  special-purpose 
hardware. 

The  computational  demands  of  practical  machine-vision  applications  frequently 
exceed  the  capacity  of  conventional  serial  computer  architectures.  For  this  reason, 
attempts  have  been  made  to  reduce  computation  time  by  decomposing  serial  al¬ 
gorithms  into  segments  that  can  be  executed  simultaneously  on  parallel  hardware 
architectures.  Because  many  classes  of  algorithms  do  not  readily  decompose,  one 
seeks  some  other  basis  for  parallelism.  We  have  investigated  the  use  of  techniques 
that  exhibit  natural  parallelism.  For  example,  in  Fischler  and  Firschein  [87]  we 
show  that  “guessing”  the  answer  to  a  problem  and  then  checking  its  validity  is  a 
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useful  approach,  and  that  a  number  of  important  vision  algorithms  can  be  viewed  as 
having  this  structure;  a  parallel  architecture  capable  of  executing  such  algorithms 
is  described. 
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